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Abstract Replication is a standard technique for fault tol- 
erance in distributed systems modeled as deterministic finite 
state machines (DFSMs or machines). To correct / crash or 
L//2J Byzantine faults among n different machines, replica- 
tion requires nf additional backup machines. We present a 
solution called fusion that requires just / additional backup 
machines. First, we build a framework for fault tolerance in 
DFSMs based on the notion of Hamming distances. We in- 
troduce the concept of an (/, m)-fusion, which is a set of 
m backup machines that can correct / crash faults or L//2J 
Byzantine faults among a given set of machines. Second, we 
present an algorithm to generate an (/, /)-fusion for a given 
set of machines. We ensure that our backups are efficient 
in terms of the size of their state and event sets. Third, we 
use locality sensitive hashing for the detection and correc- 
tion of faults that incurs almost the same overhead as that for 
replication. We detect Byzantine faults with time complex- 
ity 0(nf) on average while we correct crash and Byzantine 
faults with time complexity 0{npf) with high probability, 
where p is the average state reduction achieved by fusion. Fi- 
nally, our evaluation of fusion on the widely used MCNC'91 
benchmarks for DFSMs show that the average state space 
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savings in fusion (over replication) is 38% (range 0-99%). 
To demonstrate the practical use of fusion, we describe its 
potential application to the MapReduce framework. Using a 
simple case study, we compare replication and fusion as ap- 
plied to this framework. While a pure replication-based solu- 
tion requires 1.8 million map tasks, our fusion-based solution 
requires only 1.4 million map tasks with minimal overhead 
during normal operation or recovery. Hence, fusion results in 
considerable savings in state space and other resources such 
as the power needed to run the backup tasks. 

Keywords Distributed Systems, Fault Tolerance, Finite 
State Machines, Coding Theory, Hamming Distances. 



1 Introduction 

Distributed applications often use deterministic finite state 
machines (referred to as DFSMs or machines) to model com- 
putations such as regular expressions for pattern detection, 
syntactical analysis of documents or mining algorithms for 
large data sets. These machines executing on distinct dis- 
tributed processes are often prone to faults. Traditional so- 
lutions to this problem involve some form of replication. To 
correct / crash faults 0251 among n given machines (referred 
to as primaries), f copies of each primary are maintained 
111711281 261. If the backups start from the same initial state as 
the corresponding primaries and act on the same events, then 
in the case of faults, the state of the failed machines can be 
recovered from one of the remaining copies. These backups 
can also correct L//2J Byzantine faults Ifl8ll . where the pro- 
cesses lie about the state of the machine, since a majority of 
truthful machines is always available. This approach, requir- 
ing nf total backups, is expensive both in terms of the state 
space of the backups and other resources such as the power 
needed to run these backups. 
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A (Parity of Os, 2s) Fi (2 states, 1 event) 




(i) Primaries (ii) RCP: Ineffecient Backup (it;) State Efficient Backup 



Fig. 1 Correcting one crash fault among [A, B, C) using just one additional backup rather than three backups required by replication. 



Consider a distributed application that is searching for 
three different string patterns in a file. These string patterns 
or regular expressions are usually modeled as DFSMs. Con- 
sider the state machines A, B and C shown in Fig.Q] A state 
machine in our system consists of a finite set of states and a 
finite set of events. On application of an event, the state ma- 
chine transitions to the next state based on the state-transition 
function. For example, machine A in Fig.[T]contains the states 
{a , a 1 }, events {0,2} and the initial state, shown by the dark 
ended arrow, is a . The state transitions are shown by the ar- 
rows from one state to another. Hence, if A is in state a and 
event is applied to it, then it transitions to state a 1 . In this 
example, A checks the parity of {0, 2} and so, if it is in state 
a , then an even number of 0s or 2s have been applied to the 
machine and if it is in state a}, then an odd number of the 
inputs have been applied. Machines B and C check for the 
parity of {1, 2} and {0} respectively. 

To correct one crash fault among these machines, replica- 
tion requires a copy of each of them, resulting in three backup 
machines, consuming total state space of eight (2 3 ). Another 
way of looking at replication in DFSMs is by constructing a 
backup machine that is the reachable cross product or RCP 
(formally defined in section 13. U of the original machines. 
As shown in Fig. [1] each state of the RCP, denoted by R, is 
a tuple, in which the elements corresponds to the states of 
A, B and C respectively. Let each of the machines A, B, C 
and R start from their initial state. If some event sequence 
(generated by the client/environment) — > 2 — > 1 is ap- 
plied on these machines, then the state of R, A, B and C are 
r b = {a°b°c }, a , b° and c 1 respectively. Here, even if one 
of the primaries crash, using the state of R, we can determine 
the state of the crashed primary. Hence, the RCP is a valid 
backup machine. 

However, using the RCP of the primaries as a backup has 
two major disadvantages: (;) Given n primaries each contain- 
ing O(s) states, the number of states in the RCP is 0{s"), 
which is exponential in the number of primaries. In Fig. Q] 



R has eight states. (;;) The event set of the RCP is the union 
of the event sets of the primaries. In Fig. Q~]while A, B and C 
have only two, two and one event respectively in their event 
sets, R has three events. This translates to increased load on 
the backup. Can we generate backup machines that are more 
efficient than the RCP in terms of states and events? 

Consider F\ shown in Fig. Q] If the event sequence — » 
— » 1 — » 2 is applied the machines, A, B,C and F\ , then they 
will be in states a 1 , b°, c° and /j 1 . Assume a crash fault in C. 
Given the parity of Is (state of F{) and the parity of Is or 2s 
(state of B), we can first determine the parity of 2s. Using this, 
and the parity of 0s or 2s (state of A), we can determine the 
parity of 0s (state of C). Hence, we can determine the state of 
C as c° using the states of A, B and F\ . This argument can be 
extended to correcting one fault among any of the machines 
in {A, B, C, Fi}. This approach consumes fewer backups than 
replication (one vs. three), fewer states than the RCP (two 
states vs. eight states) and fewer number of events than the 
RCP (one event vs. three events). How can we generate such 
a backup for any arbitrary set of machines? In Fig. Q] can F\ 
and F2 correct two crash faults among the primaries? Further, 
how do we correct the faults? In this paper, we address such 
questions through the following contributions: 



Framework for Fault Tolerance in DFSMs We explore the 
idea of a fault graph and use that to define the minimum Ham- 
ming distance Ifl3l for a set of machines. Using this frame- 
work, we can specify the exact number of crash or Byzantine 
faults a set of machines can correct. Further, we introduce the 
concept of an (f, m)-fusion which is a set of m machines that 
can correct / crash faults, detect / Byzantine faults or correct 
L//2J Byzantine faults. We refer to the machines as fusions 
or fused backups. In Fig.Q] F\ and Fn can correct two crash 
faults among {A, B, C) and hence {Fi, Ft \ is a (2, 2)-fusion of 
{A, B, C). Replication is just a special case of (/, m)-fusion 
where m = nf. We prove properties on the (/, m)-fusion for 
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a given set of primary machines including lower bounds for 
the existence of such fusions. 

Algorithm to Generate Fused Backup Machines Given a set 
of n primaries we present an algorithm that generates an (/, 
/)-fusion corresponding to them, i.e., we generate a set of / 
backup machines that can correct / crash or L//2J Byzantine 
faults among them. We show that our backups are efficient 
in terms of: (/) The number of states in each backup (if) The 
number of events in each backup (Hi) The minimality (defined 
in section [3~4T i of the entire set of backups in terms of states. 
Further, we show that if our algorithm does not achieve state 
and event reduction, then no solution with the same number 
of backups achieves it. Our algorithm has time complexity 
polynomial in N, where N is the number of states in the RCP 
of the primaries. We present an incremental approach to this 
algorithm that improves the time complexity by a factor of 
0(p"), where p is the average state savings achieved by fu- 
sion. 

Detection and Correction of Faults We present a Byzantine 
detection algorithm with time complexity O(nf) on average, 
which is the same as the time complexity of detection for 
replication. Hence, for a system that needs to periodically 
detect liars, fusion causes no additional overhead. We re- 
duce the problem of fault correction to one of finding points 
within a certain Hamming distance of a given query point in 
n-dimensional space and present algorithms to correct crash 
and Byzantine faults with time complexity 0(npf) with high 
probability (w.h.p). The time complexity for crash and Byzan- 
tine correction in replication is 0(f) and 0(nf) respectively. 
Hence, for small values of « and p, fusion causes almost no 
overhead for recovery. Table [1] describes the main symbols 
used in this paper, while Table[2]summarizes the main results 
in the paper through a comparison with replication. 

Fusion-based Grep in the MapReduce Framework To illus- 
trate the practical use of fusion, we consider its potential ap- 
plication to the grep functionality of the MapReduce frame- 
work (8). The MapReduce framework is a prevalent solu- 
tion to model large scale distributed computations. The grep 
functionality is used in many applications that need to iden- 
tify patterns in huge textual data such as data mining, ma- 
chine learning and query log analysis. Using a simple case 
study, we show that a pure replication-based approach for 
fault tolerance needs 1.8 million map tasks while our fusion- 
based solution requires only 1.4 million map tasks. Further, 
we show that our approach causes minimal overhead during 
normal operation or recovery. 

Fusion-based Design Tool and Experimental Evaluation We 
provide a Java design tool based on our fusion algorithm, that 



takes a set of input machines and generates fused backup ma- 
chines corresponding to them. We evaluate our fusion algo- 
rithm on the MCNC'91 ED benchmarks for DFSMs, that 
are widely used in the fields of logic synthesis and circuit de- 
sign. Our results show that the average state space savings in 
fusion (over replication) is 38% (range 0-99%), while the av- 
erage event-reduction is 4% (range 0-45%). Further, the av- 
erage savings in time by the incremental approach for gener- 
ating the fusions (over the non-incremental approach) is 8%. 

In section [2] we specify the system model and assump- 
tions of our work. In section [3] we describe the theory of 
our backup or fusion machines. Following this, we present 
algorithms to generate these fusion machines in section |4] 
In section [5] we present the algorithms for the detection and 
correction of faults in a system with primary and fusion ma- 
chines. Sections [6] and [7] deal with the practical aspects and 
experimental evaluation of fusion. In section [8] we consider 
potential solutions to this problem, outside the framework of 
this paper. Section [9] covers the related work in this area. Fi- 
nally, we summarize our work and discuss future extensions 
in section [TOl 

2 Model 

The DFSMs in our system execute on separate distributed 
processes. We assume loss-less FIFO communication links 
with a strict upper bound on the time taken for message de- 
livery. Clients of the state machines issue the events (or com- 
mands) to the concerned primaries and backups. For simplic- 
ity, we assume that there is a single client issuing the events 
to the machines. This along with FIFO links ensures that all 
machines act on the events in the same relative order. This 
can be extended to multiple clients using standard total order 
broadcast mechanisms present in the literature Il9l20l. 

The execution state of a machine is the current state in 
which it is executing. Faults in our system are of two types: 
crash faults, resulting in a loss of the execution state of the 
machines and Byzantine faults resulting in an arbitrary ex- 
ecution state. We assume that the given set of primary ma- 
chines cannot correct a single crash fault amongst themselves. 
When faults are detected by a trusted recovery agent using 
timeouts (crash faults) or a detection algorithm (Byzantine 
faults) no further events are sent by any client to these ma- 
chines. Assuming the machines have acted on the same se- 
quence of events, the recovery agent obtains their states, and 
recovers the correct execution states of all faulty machines. 

3 Framework for Fault Tolerance in DFSMs 

In this section, we describe the framework using which we 
can specify the exact number of crash or Byzantine faults 
that any set of machines can correct. Further, we introduce 
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Table 1 Symbols/Notation used in the paper 



p 


Set of primaries 


n 


Number of primaries 


RCP 


Reachable Cross Product 


N 


Number of states in the RCP 


f 


No. of crash faults 


s 


Maximum number of states among primaries 


T 


Set of fusions/backups 


P 


Average State Reduction in fusion 


I 


Union of primary event-sets 


P 


Average Event Reduction in fusion 



Table 2 Replication vs. Fusion (Columns 2 and 3 for / crash faults, 4 and 5 for / Byzantine faults) 





Rep-Crash 


Fusion- Crash 


Rep-Byz 


Fusion-Byz 


Number of Backups 


nf 


/ 


2nf 


2f 


Backup State Space 


ft 




j2»/ 


(s"lp) 11 


Average Events/Backup 


\E\jn 




\I\ln 




Fault Detection Time 


0(1) 


0(1) 


0(nf) 


0(nf) (on avg.) 


Fault Correction Time 


0(f) 


0(npf) w.h.p 


0(nf) 


0(npf) w.h.p 


Fault Detection Messages 


0(1) 


0(1) 


2nf 


n + f 


Fault Correction Messages 


/ 


n 


n + 2f 


n + f 


Backup Generation Time Complexity 


0(nsf) 


0(s"\£\f/p") 


O(nsf) 


0( S "\£\f/p") 



the concept of an (/, m)-fusion for a set of primaries that 
is a set of machines that can correct / crash faults, detect / 
Byzantine faults and correct L//2J Byzantine faults. 



3.1 DFSMs and their Reachable Cross Product 

A DFSM, denoted by A, consists of a set of states Xa, set 
of events Ea, transition function a a : Xa x Ea —* Xa and 
initial state a . The size of A, denoted by |A| is the number 
of states in Xa- A state, s E Xa, is reachable iff there exists 
a sequence of events, which, when applied on the initial state 
a , takes the machine to state s. Consider any two machines, 
A (Xa, Ea, oa, a°) and B (Xg, Eg, ob, b°). Now construct 
another machine which consists of all the states in the product 
set of Xa and Xb with the transition function a' ({a, b), <f) = 
{aA(a,o-),aB(b,cr)} for all {a,b} e X A x X B and cr e E A U 
Eb- This machine [Xa x X b , Ea U Eb, a', {a , b }) may have 
states that are not reachable from the initial state {a , b }. If 
all such unreachable states are pruned, we get the reachable 
cross product of A and B. In Fig.[TJ R is the reachable cross 
product of A, B and C. Throughout the paper, when we just 
say RCP, we refer to the reachable cross product of the set 
of primary machines. Given a set of primaries, the number of 
states in its RCP is denoted by N and its event set, which is 
the union of the event sets of the primaries is denoted by E. 

As seen in section [TJ given the state of the RCP, we can 
determine the state of each of the primary machines and vice 
versa. However, the RCP has states exponential in n and an 
event set that is the union of all primary event sets. Can we 
generate machines that contains fewer states and events than 
the RCP? In the following section, we first define the notion 
of order and the 'less than or equal to' (<) relation among 
machines. 



3.2 Order Among Machines and their Closed Partition 
Lattice 

Consider a DFSM, A = (X A ,E, a A , x° A ). A partition P, on the 
state set Xa of A is the set {Bi, . . . , BjJ, of disjoint subsets of 
the state set X A , such that Uf=i B = X A and B, nBj = cf> for 
' ^ j [19|. An element B, of a partition is called a block. 
A partition, P, is said to be closed if each event, <x e E, 
maps a block of P into another block. A closed partition P, 
corresponds to a distinct machine. Given any machine A, we 
can partition its state space such that the transition function 
a a, maps each block of the partition to another block for all 
events in E A 01411191 . 

In other words, we combine the states of A to generate 
machines that are consistent with the transition function. We 
refer to the set of all such closed partitions as the closed par- 
tition set of A. In this paper, we discuss the closed partitions 
corresponding to the RCP of the primaries. In Fig. [2] we show 
the closed partition set of the RCP of {A, B, C] (labeled R). 
Consider machine M 2 in Fig. [2] generated by combining the 
states r° and r 2 of R. Note that, on event 1, r° transitions to r 1 
and r 2 transitions to r 3 . Hence, we need to combine the states 
r l and r 3 . Continuing this procedure, we obtain the combined 
states in Mi- Hence, we have reduced the RCP to generate M. 
By combining different pairs of states and by further reducing 
the machines thus formed, we can construct the entire closed 
partition set of R. 

We can define an order (<) among any two machines P 
and Q in this set as follows: P < Q, if each block of Q is 
contained in a block of P (shown by an arrow from P to Q). 
Intuitively, given the state of Q we can determine the state of 
P. Machines P and Q are incomparable, i.e., P\\Q, if P it Q 
and Q it P. In Fig. |2] F 3 < M 2 , while M X \\M 2 . It can be 
seen that the set of all closed partitions corresponding to a 
machine, form a lattice under the < relation lfT4l . We saw 
in section 13.11 that given the state of the primaries, we can 
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determine the state of the RCP and vice versa. Hence, the 
primary machines are always part of the closed partition set 
of the RCP (see A, B and C in Fig. 0. 

Among the machines shown in Fig. [2 some of them, like 
F2 (4 states, 3 events) have reduced states, while some like 
Mi (4 states, 2 events) and F\ (2 states, 1 event) have both re- 
duced states and events as compared to R (8 states, 3 events). 
Which among these machines can act as backups? In the fol- 
lowing section, we describe the concept of fault graphs and 
their Hamming distances to answer this question. 

3.3 Fault Graphs and Hamming Distances 

We begin with the idea of a fault graph of a set of machines 
Ai, for a machine T, where all machines in Ai are less than 
or equal to T. This is a weighted graph and is denoted by 
G(T, Ai). The fault graph is an indicator of the capability of 
the set of machines in Ai to correctly identify the current 
state of T. As described in the previous section, since all the 
machines in Ai are less than or equal to T, the set of states of 
any machine in Ai corresponds to a closed partition of the set 
of states of T. Hence, given the state of T, we can determine 
the state of all the machines in Ai and vice versa. 

Definition 1 (Fault Graph) Given a set of machines Ai and 
a machine T = {X T ,Z T ,a T , t°) such that VM e Ai : M < T, 
the fault graph G(T, Ai) is a. fully connected weighted graph 
where, 

- Every node of the graph corresponds to a state in Xj 

- The weight of the edge (?', f J ) between two nodes, where 
f',f ; e Xj, is the number of machines in Ai that have 
states t' and f J in distinct blocks 

We construct the fault graph G(R, {A}), referring to Fig. [2] 
A has two states, a = [r°, r 1 , r 5 , r 6 } and a 1 = {r 2 , r\ r 4 , r 1 }. 



Given just the current state of A, it is possible to determine if 
R is in state r° or r 2 (exact) or one of r° and r 1 (ambiguity). 
Here, A distinguishes between the (r°, r 2 ) but not between 
(r°, r l ). Hence, in the fault graph G{R, {A}) in Fig. [3] (/), the 
edge (r°,r 2 ) has weight one, while (r°, r 1 ) has weight zero. 
A machine M e Ai, is said to cover an edge (?, t J ) if f and 
t-* lie in separate blocks of M, i.e., M separates the states t' 
and t j . In Fig. [2] A covers (r°,r 2 ). In Fig. l9land[T0lof the 
Appendix, we show an example of the closed partition set 
and fault graphs for a different set of primaries. 

Given the states of \Ai\ - x machines in \Ai\, it is always 
possible to determine if T is in state f or t 1 iff the weight of 
the edge (f', t J ) is greater than x. Consider the graph shown in 
Fig.[3](;7). Given the state of any two machines in {A,B,C}, 
we can determine if R is in state r° or r 2 , since the weight 
of that edge is greater than one, but cannot do the same for 
the edge (r°, r l ), since the weight of the edge is one. In cod- 
ing theory [7 ,24), the concept of Hamming distance ITPJI is 
widely used to specify the fault tolerance of an erasure code. 
If an erasure code has minimum Hamming distance greater 
than d, then it can correct d erasures or [d/2] errors. To un- 
derstand the fault tolerance of a set of machines, we define a 
similar notion of distances for the fault graph. 

Definition 2 (distance) Given a set of machines Ai and their 
reachable cross product T (Xj^t^tJ ), the distance be- 
tween any two states f,-, tj E Xj, denoted by d(tj, tj), is the 
weight of the edge (f,-, tj) in the fault graph G(T, Ai). The 
least distance in G(T, Ai) is denoted by d„,i„(T, AI). 

Given a fault graph, G(T, Ai), the smallest distance be- 
tween the nodes in the fault graph specifies the fault tolerance 
of Ai. Consider the graph, G(R, {A, B, C, F u F 2 }), shown in 
Fig.[3](v). Since the smallest distance in the graph is three, 
we can remove any two machines from {A, B, C, F\, F2} and 
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(i)G({A}) (ii)G({A,B,C}) {in)G({A,B,C,R}) (iv) G({A, B, C, F,}) (v) G({A,B,C,F U F 2 }) 

Fig. 3 Fault Graphs, G(R, M), for sets of machines shown in Fig. [2] For notational convenience, we just label the graphs with G(M). All eight nodes 
r°-r 7 with their edges have not been shown due to space constraints. 



still regenerate the current state of R. As seen before, given 
the state of R, we can determine the state of any machine less 
than R. Therefore, the set of machines {A, B, C, F\,F2] can 
correct two crash faults. 

Theorem 1 A set of machines At, can correct up to f crash 
faults iff d m i„(T, At) > f, where T is the reachable cross- 
product of all machines in At. 

Proof (=>) Given that d m j n {T, At) > f, we show that any 
Ai — f machines from At can accurately determine the cur- 
rent state of T, thereby recovering the state of the crashed 
machines. Since d m i„(T, At) > f, by definition, at least / + 1 
machines separate any two states of Xj. Hence, for any pair 
of states (tj, tj) 6 Xt, even after / crash failures in At, at least 
one machine remains that can distinguish between and tj. 
This implies that it is possible to accurately determine the 
current state of T by using any Ad- f machines from At 

(<^) Given that d,„i„(T, At) < /, we show that the system 
cannot correct / crash faults. The condition d m i„(T, At) < f 
implies that there exists states f,- and tj in G(T, At) separated 
by distance k, where k < f. Hence there exist exactly k ma- 
chines in At that can distinguish between states tj,tj e Xj. 
Assume that all these k machines crash (since k < f) when T 
is in either f,- or tj. Using the states of the remaining machines 
in At, it is not possible to determine whether T was in state 
ti or tj. Therefore, it is not possible to exactly regenerate the 
state of any machine in At using the remaining machines. 

Byzantine faults may include machines which lie about 
their state. Consider the machines {A, B, C, Fi,F2] shown in 
Fig. |2 From Fig. |3](v), Let the execution states of the ma- 
chines A, B, C, Fi and Fo be 

a = {r\r l ,r 5 ,r 6 ),b x = [r\ r\ r\ r 5 }, c° = fr°, r\ r\ r 1 } 

/Mr°,rVV},/? = {rV}, 
respectively. Since r° appears four times (greater than ma- 
jority) among these states, even if there is one liar we can 
determine that R is in state r°. But if R is in state r , then B 
must have been in state b° which contains r°. So clearly, B is 
lying and its correct state is b l . Here, we can determine the 
correct state of the liar, since d,„i„(R, {A, B, C, F\, F2}) = 3, 
and the majority of machines distinguish between all pairs of 
states. 



Theorem 2 A set of machines At, can correct up to f Byzan- 
tine faults iff d m i„(T, At) > If, where T is the reachable 
cross-product of all machines in At. 

Proof (=>) Given that d m i„(T, At) > 2f, we show that any 
At-/ correct machines from At can accurately determine the 
current state of T in spite of / liars. Since d m i„(T, Ai) > 2f, 
at least 2f+ 1 machines separate any two states of Xj . Hence, 
for any pair of states tj e Xj, after / Byzantine failures in 
At, there will always be at least / + 1 correct machines that 
can distinguish between f,- and tj. This implies that it is pos- 
sible to accurately determine the current state of T by simply 
taking a majority vote. 

(<^) Given that d mm {T, At) < 2f, we show that the system 
cannot correct / Byzantine faults. d n ,i„(T, At) < 2/ implies 
that there exists states f,-, tj € Xt separated by distance k, 
where k < 2f. If / among these k machines lie about their 
state, we have only k - f correct machines remaining. Since, 
k - f < f, it is impossible to distinguish the liars from the 
truthful machines and regenerate the correct state of T. 

In this paper, we are concerned only with the fault graph 
of machines w.r.t the RCP of the primaries P. For notational 
convenience, we use G(Al) instead of G(RCP, At) and <f mH1 (At) 
instead of d m i„(RCP, At). From theorems \T\ and [2] it is clear 
that a set of n machines V, can correct (d m i„CP) - 1) crash 
faults and l(d m i„(P) - 1)/2J Byzantine faults. Henceforth, we 
only consider backup machines less than or equal to the RCP 
of the primaries. In the following section, we describe the 
theory of such backup machines. 

3.4 Theory of (/, m)-fusion 

To correct faults in a given set of machines, we need to add 
backup machines so that the fault tolerance of the system 
(original set of machines along with the backups) increases to 
the desired value. To simplify the discussion, in the remain- 
der of this paper, unless specified otherwise, we mean crash 
faults when we simply say faults. Given a set of « machines 
V, we add m backup machines T, each less than or equal to 
the RCP, such that the set of machines in f U f can correct 
/ faults. We call the set of m machines in T , an (/, m)-fusion 
of V. From theorem Q] we know that, d mm CP U T) > f. 
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Definition 3 (Fusion) Given a set of n machines P, we re- 
fer to the set of m machines T , as an (f, m)-fusion of P, if 
d min (P UT)>f. 

Any machine belonging to T is referred to as a fused 
backup or just a fusion. Consider the set of machines, P = 
{A, B, C], shown in Fig. Q] From Fig. \5\(ii), d mi „({A, B, C}) = 
1 . Hence the set of machines P, cannot correct a single fault. 
To generate a set of machines T, such that, PUf can correct 
two faults, consider Fig.|3](v). Since c/, m „({A, B, C, F\, F2}) = 
3, {A,B,C,F\,F2} can correct two faults. Hence, {F\,F2} is 
a (2, 2)-fusion of {A, B, C). Note that the set of machines in 
{A, A, B, B, C, C}, i.e., replication, is a (2, 6)-fusion of {A, B, C] 

Any machine in the set {A, fi, C, Fj, F2} can at most con- 
tribute a value of one to the weight of any edge in the graph 
G({A, B, C, F\, F2}). Hence, even if we remove one of the ma- 
chines, say F2, from this set, d,„i„({A, B, C, Fi}) is greater than 
one. So {Fi} is an (1, l)-fusion of {A, B, C). 

Theorem 3 (Subset of a Fusion) Given a set of n machines 
P, and an (f, m)-fusion f, corresponding to it, any subset 
f'cf such that \f'\ = m — t is a (f — t, m — t)-fusion when 
t < min(f, m). 

Proof Since, T is an (/, m)-fusion of P, d min (P U T) > f. 
Any machine, F e T, can at most contribute a value of one 
to the weight of any edge of the graph, G(P U T). Therefore, 
even if we remove t machines from the set of machines in T, 
dminiP U T) > f - t. Hence, for any subset T' Q T, of size 
m-t, d min (P U f) > f -t. This implies that T' is an (/ - t, 
m - f)-fusion of P. 

It is important to note that the converse of this theorem 
is not true. In Fig. [2] while {M2} and {Fi} are (1, l)-fusions 
of {A, B, C}, since d min ({A, B, C, M 2 , F x }) = 2, {M 2 , Fi } is not 
a (2, 2)-fusion of {A, B, C}. We now consider the existence 
of an (/, m)-fusion for a given set of machines P. Consider 
the existence of a (2, l)-fusion for {A, B, C] in Fig. [2] From 
Fig. E| (ii), d m i„{{A, B, C}) = 1. Clearly, R covers each pair of 
edges in the fault graph. Even if we add R to this set, from 
Fig.[3](2zz), d m i n ({A, B, C,R}) < 3. Hence, there cannot exist a 
(2, l)-fusionfor{A,B,C}. 

Theorem 4 (Existence of Fusions) Given a set ofn machines 
P, there exists an (f, m)-fusion ofP iff m + d m i„(P) > f. 

Proof (=>) Assume that there exists an (/, ra)-fusion f for 
the given set of machines P. Since, T is an (/, m)-fusion 
fusion of P, d m i„(P Uf)>/. The m machines in f, can at 
most contribute a value of m to the weight of each edge in 
G(P U F). Hence, m + d m i„(P) has to be greater than /. 

(<=) Assume that m + d m i„{P) > f. Consider a set of m 
machines f, containing m copies of the RCP. These copies 
contribute exactly m to the weight of each edge in G(P U T). 
Since, d min (P) > f - m, d min (P U T) > f. Hence, T is an (f, 
f7z)-fusion of P. 



Given a set of machines, we now define an order among 
(/, w)-fusions corresponding to them. 

Definition 4 (Order among (/, m)-fusions) Given a set of n 
machines P, an (/, m)-fusion T = {F\,..F m }, is less than 
another (/, m)-fusion Q, i.e, T < Q, iff the machines in Q 
can be ordered as {G\,Gn, - G,,,} such that VI < i < m : (F, < 
G t ) A (3; : Fj < Gj). 

An (/, m)-fusion f is minimal, if there exists no (/, m)- 
fusion T' , such that, T' < T ■ It can be seen that, 

d min ({A,B,C,M 2 ,F2})=3, 

and hence, f = {M 2 , F 2 } is a (2, 2)-fusion of {A, B, C}. We 
have seen that T = {F\,F2}, is a (2, 2)-fusion of {A,B,C}. 
From Fig. [2j since Fj < M 2 , T < T' ■ In Fig. [2] since R ± 
cannot be a fusion for {A, B, C], there exists no (2, 2)-fusion 
less than \F\ , F 2 }- Hence, [F\ , F 2 ] is a minimal (2, 2)-fusion 
of {A,B,C}. 

We now prove a property of the fusion machines that is 
crucial for practical applications. Consider a set of primaries 
P and an (/, m)-fusion T corresponding to it. The client 
sends updates addressed to the primaries to all the backups 
as well. We show that events or inputs that belong to distinct 
set of primaries, can be received in any order at each of the 
fused backups. This eliminates the need for synchrony at the 
backups. 

Consider a fusion F e T ■ Since the states of F are es- 
sentially partitions of the state set of the RCP, the state tran- 
sitions of F are defined by the state transitions of the RCP. 
For example, machine M\ in Fig. [2] transitions from {r°,r 2 } 
to {r',r 3 } on event 1, because r° and r 2 transition to r l and 
r 3 respectively on event 1 . Hence, if we show that the state 
of the RCP is independent of the order in which it receives 
events addressed to different primaries, then the same applies 
to the fusions. 

Theorem 5 ( Commutativity) The state of a fused backup af- 
ter acting on a sequence of events, is independent of the order 
in which the events are received, as long as the events belong 
to distinct sets of primaries. 

Proof We first prove the theorem for the RCP, which is also a 
valid fused backup. Let the set of primaries be P — {P\ . . . P„}. 
Consider an event e, that belongs to the set of primaries Si C 
P. If the RCP is in state r, its next state transition on event e, 
depends only on the transition functions of the primaries in 
Si. Hence, the state of the RCP after acting on two events e a 
and eb is independent of the order in which these events are 
received by the RCP, as long as S a U «Sj = (p. The proof of 
the theorem follows directly from this. 

So far, we have presented the framework to understand 
fault tolerance among machines. Given a set of machines, 
we can determine if they are a valid set of backups by con- 
structing the fault graph of those machines. In the following 
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section, we present a technique to generate such backups au- 
tomatically. 

4 Algorithm to Generate Fused Backup Machines 

Given a set of n primaries P, we present an algorithm to gen- 
erate an (/, /)-fusion T of P. The number of faults to be 
corrected, /, is an input parameter based on the system's re- 
quirements. The algorithm also takes as input two parameters 
As and Ae and ensures (if possible) that each machine in T 
has at most (N-As) states and at most (\£\-Ae) events, where 
N and E are the number of states and events in the RCP. Fur- 
ther, we show that T is a minimal fusion of P. The algorithm 
has time complexity polynomial in N. 

The genFusion algorithm executes / iterations and in each 
iteration adds a machine to T that increases d m i„(P U T) (re- 
ferred to as d m i„) by one. At the end of / iterations, d m \ n in- 
creases to / + 1 and hence fUf can correct / faults. The 
algorithm ensures that the backup selected in each iteration is 
optimized for states and events. In the following paragraphs, 
we explain the genFusion algorithm in detail, followed by an 
example to illustrate its working. 

In each iteration of the genFusion algorithm (Outer Loop), 
we first identify the set of weakest edges in P U f and then 
find a machine that covers these edges, thereby increasing 
d m in by one. We start with the RCP, since it always increases 
d,„i„. The 'State Reduction Loop' and the 'Event Reduction 
Loop' successively reduce the states and events of the RCP. 
Finally the 'Minimality Loop' searches as deep into the closed 
partition set of the RCP as possible for a reduced state ma- 
chine, without explicitly constructing the lattice. 

State Reduction Loop: This loop uses the reduceState al- 
gorithm in Fig.|4]to iteratively generate machines with fewer 
states than the RCP that increase d mm by one. The reduceS- 
tate algorithm, takes as input, a machine P and generates a set 
of machines in which at least two states of P are combined. 
For each pair of states Si, sj in Xp, the reduceState algorithm, 
first creates a partition of blocks in which (s,, Sj) are com- 
bined and then constructs the largest machine consistent with 
this partition. Note that, 'largest' is based on the order spec- 
ified in section 13.21 This procedure is repeated for all pairs 
in Xp and the largest incomparable machines among them 
are returned. At the end of As iterations of the state reduc- 
tion loop, we generate a set of machines M each of which 
increases c/ m /„ by one and contains at most (N - As) states, if 
such machines exist. 

Event Reduction Loop: Starting with the state reduced 
machines in At, the event reduction loop uses the reduceEvent 
algorithm in Fig. [4] to generate reduced event machines that 
increase d m i„ by one. The reduceEvent algorithm, takes as 
input, a machine P and generates a set of machines that con- 
tain at least one event less than Ep. To generate a machine 
less than any given input machine P, that does not contain an 



event cr in its event set, the reduceEvent algorithm combines 
the states such that they loop onto themselves on cr. The algo- 
rithm then constructs the largest machine that contains these 
states in the combined form. This machine, in effect, ignores 
cr. This procedure is repeated for all events in Ep and the 
largest incomparable machines among them are returned. At 
the end of Ae iterations of the event reduction loop, we gener- 
ate a set of machines M each of which increases d mm by one 
and contains at most (N - As) states and at most (\E\ - Ae) 
events, if such machines exist. Q 

Minimality Loop: This loop picks any machine M among 
the state and event reduced machines in M and uses the re- 
duceState algorithm iteratively to generate a machine less 
than M that increases d m i„ by one until no further state reduc- 
tion is possible i.e., all the states of M have been combined. 
Unlike the state reduction loop (which also uses the reduceS- 
tate algorithm), in the minimality loop we never exhaustively 
explore all state reduced machines. After each iteration of 
the minimality loop, we only pick one machine that increases 
d min by one. 

Note that, in all three of these inner loops, if in any iter- 
ation, no reduction is achieved, then we simply exit the loop 
with the machines generated in the previous iteration. We use 
the example in Fig. [2] with P = {A, B, C}, f = 2, As = 1 and 
Ae = 1, to explain the genFusion algorithm. Since / = 2, 
there are two iterations of the outer loop and in each itera- 
tion we generate one machine. Consider the first iteration of 
the outer loop. Initially, T is empty and we need to add a 
machine that covers the weakest edges in G({A, B, C}). 

To identify the weakest edges, we need to identify the 
mapping between the states of the RCP and the states of the 
primaries. For example, in Fig. [2] we need to map the states of 
the RCP to A. The starting states are always mapped to each 
other and hence r° is mapped to a . Now r° on event tran- 
sitions to r 2 , while a on event transitions to a . Hence, r 2 
is mapped to a 1 . Continuing this procedure for all states and 
events, we obtain the mapping shown, i.e, a = {r°, r l ,r 5 , r 6 } 
and a 1 = {r 2 , r 5 , r 4 , r 1 }. Following this procedure for all pri- 
maries, we can identify the weakest edges in G{{A, B,C}) 
(Fig. SO';))- In Fig. [2] Mi, Mi and Fo are some of the largest 
incomparable machines that contain at least one state less 
than the RCP (the entire set is too large to be enumerated 
here). All three of these machines increase c/„„„ and at the 
end of the one and only iteration of the state reduction loop, 
M will contain at least these three machines. 

The event reduction loop tries to find machines with fewer 
events than the machines in M. For example, to generate a 
machine less than M2 that does not contain, say event 2, the 
reduceEvent algorithm combines the blocks of Mn such that 
they do not transition on event 2. Hence, {r°, r 2 } in Mi is 
combined with {r 4 ,r 5 } and \r l ,r i } is combined with {r 6 ,r 7 } 

In Appendix [A] we present the concept of the event-based decom- 
position of machines to replace a given machine A with a set of ma- 
chines that contain fewer events than Z A . 
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genFusion 


reduceState 


Input: Primaries P, faults /, state-reduction parameter As, 


Input: Machine P with state set Xp, event set Zp 


event-reduction parameter Ae; 


and transition function «/>; 


Output: (/, /)-fusion off; 


Output: Largest Machines < P with < \X P \ - 1 states; 




& = {}; 


//Outer Loop 


for (ij, s ; - e X P ) 


for ((' = 1 to /) 


//combine states .v, and s ; 


Identify weakest edges in fault graph G(P U T)\ 


Set of states, Xb = Xp with (s, , sy) combined; 


M <- {RCP(P)}; 


S = SU {Largest machine consistent with Xb); 


//State Reduction Loop 


return largest incomparable machines in 2>; 


for (j = 1 to As) 






reduceEvent 


for (M e M) 


Input: Machine P with state set X P , event set Zp 


5 = 5 U reduceState(M); 


and transition function a P ; 


M = All machines in 5 that increment d mm (P U !F); 


Output: Largest Machines < P with < \Z P \ - 1 events; 


//Event Reduction Loop 


s = {); 


for (y = 1 to Ae) 


for (cr e Zp) 


S<-{}; 


Set of states, = Xp; 


for (M e M) 


//combine states to self-loop on cr 


fi = fi U reduceEvent(M); 


for (s e X B ) 


M = All machines in 6 that increment d mm (P U !F); 


s = s U ap(s, cr); 


//Minimality Loop 


S = SU {Largest machine consistent with X B ); 


M <— Any machine in M; 


return largest incomparable machines in 2>; 


while (all states of M have not been combined) 




C <— reduceState(M); 




M= Any machine in C that increments d m j„(P U !F); 








return !F; 





Fig. 4 Algorithm to generate an (/, /)-fusion for a given set of primaries P. Note that, we use the terms largest, incomparable w.r.t the order defined 
in section [X2l 



to generate machine F\ that does not act on event 2. The only 
machine less than that does not act on event 1 is R ± . Since 
the reduceEvent algorithm returns the largest incomparable 
machines, only F\ is returned when M2 is the input. Sim- 
ilarly, with Mi as input, the reduceEvent algorithm returns 
[C, F\ } and with F2 as input it returns R ± . Among these ma- 
chines only F] increases d,„j„. For example, C does not cover 
the weakest edge (r^r 1 ) of GCP). Hence, at the end of the 
one and only iteration of the event reduction loop, M = {F\}. 

As there exists no machine less than F\, that increases 
d,„i„, at the end of the minimality loop, M = F\. Similarly, 
in the second iteration of the outer loop M — Fo and the 
genFusion algorithm returns {F\,Fi\ as the fusion machines 
that increases d m i„ to three. Hence, using the genFusion algo- 
rithm, we have automatically generated the backups F\ and 
F2 shown in Fig.Q] Note that, in the worst case, there may ex- 
ist no efficient backups and the genFusion algorithm will just 
return a set of / copies of the RCP. However, our results in 
section [7] indicate that for many examples, efficient backups 
do exist. 



4. 1 Properties of the genFusion Algorithm 

In this section, we prove properties of the genFusion algo- 
rithm with respect to: (i) the number of fusion/backup ma- 
chines (//) the number of states in each fusion machine, (Hi) 
the number of events in each fusion machine and ( iv) the min- 
imality of the set of fusion machines T ■ We first introduce 
concepts that are relevant to the proof of these properties. 

Lemma 1 Given a set of primary machines f, d,„i n (P) — 1. 

Proof Given the state of all the primary machines, the state 
of the RCP can be uniquely determined. Hence, there is at 
least one machine among the primaries that distinguishes be- 
tween each pair of states in the RCP and so, d,„i„(P) > 1. 
In section [2] we assume that the set of machines in P can- 
not correct a single fault and this implies that, d,„i„(P) < 1. 
Hence, d m i„(P) = 1. 

Lemma 2 Given a set of primary machines P, let T' be an 
(f, f)-fusion of P. Each fusion machine F € T' has to cover 
the weakest edges in G(P). 



10 



Bharath Balasubramanian, Vijay K. Garg 



Proof From lemma[T| the weakest edges of G(P) have weight 
equal to one. Since T' is an (/, /)-fusion of P, d mm (P\JT') > 
f. Also, each machine in T' can increase the weight of any 
edge by at most one. Hence, all the / machines in T' have to 
cover the weakest edges in G(P). 

Let the weakest edges of G(P U T) at the start of the i 
iteration of the outer loop of the genFusion algorithm be de- 
noted Ej. In the following lemma, we show that the set of 
weakest edges only increases with each iteration. 

Lemma 3 In the genFusion algorithm, for any two iterations 
i and j, if i < j, then Ej C Ej. 

Proof Let the value of d mm for the i' h iteration be d and the 
edges with this weight be Ej. Any machine added to f can 
at most increase the weight of each edge by one and it has 
to increase the weight of all the edges in E, by one. So, d m i„ 
for the (i + l)' h iteration is d + 1 and the weight of the edges 
in Ej will increase to d + 1. Hence, Ej will be among the 
weakest edges in the (i + l)' 1 ' iteration, or in other words, 
Ej c Ej + \. This trivially extends to the result: for any two 
iterations numbered i and j of the genFusion algorithm, if 
i < j, then £, c Ej. 

We now prove one of the main theorems of this paper. 

Theorem 6 (Fusion Algorithm) Given a set of n machinesP, 
the genFusion algorithm generates a set of machines T such 
that: 

1. ( Correctness) T is an (f, f)-fusion of P. 

2. ( State &■ Event Efficiency ) If each machine in T has greater 
than (N — As) states and — Ae) events, then no (f, f)- 
fusion ofP contains a machine with less than or equal to 
(N — As) states and (|27| - Ae) events. 

3. (Minimality) T is a minimal (f, f)-fusion of P. 

Proof 1. From lemma Q] d mm (P) = 1. Starting with the 
RCP, which always increases d m i„ by one, we add one 
machine in each iteration to T that increases by d mm (P U 
T) by one. Hence, at the end of / iterations of the gen- 
Fusion algorithm, we add exactly / machines to T that 
increase d mm to / + 1 . Hence, T is an (/, /)-fusion of P. 

2. Assume that each machine in f has greater than (N - As) 
states and - Ae) events. Let there be another (/, f)- 
fusion of P that contains a machine F' with less than or 
equal to (N— As) states and ( |2T| - Ae) events. From lemma 
[2] F' covers the weakest edges in G(P). However, in the 
first iteration of the outer loop, the genFusion algorithm 
searches exhaustively for a fusion with less than or equal 
to (N - As) states and - Ae) events that covers the 
weakest edges in G(P). Hence, if such a machine F' ex- 
isted, then the algorithm would have chosen it. 

3. Let there be an (/, /)-fusion Q = {Gi, ..G/} of P, such 
that Q is less than (/, /)-fusion T = {F 2 , F u F f }. 



Hence V/ : Gj < Fj. Let G, < Fj and let Ej be the set 
of edges that needed to be covered by Fj. It follows from 
the genFusion algorithm, that G, does not cover at least 
one edge say e in Ej (otherwise the algorithm would have 
returned G, instead of Fj). From lemmafj] it follows that 
if e is covered by k machines in T ', then e has to be cov- 
ered by k machines in Q. We know that there is a pair 
of machines Ff, G, such that F, covers e and G, does not 
cover e. For all other pairs Fj, Gj if Gj covers e then Fj 
covers e (since Gj < Fj). Hence e can be covered by no 
more than k - 1 machines in Q. This implies that Q is not 
(/. /)-fusion. 

4.2 Time Complexity of the genFusion Algorithm 

The time complexity of the genFusion algorithm is the sum 
of the time complexities of the inner loops multiplied by the 
number of iterations, /. We analyze the time complexity of 
each of the inner loops. Let the set of machines in M at the 
start of the i' h iteration of the outer loop be denoted A4,. 

State Reduction Loop: The time complexity of the state 
reduction loop for the i' h iteration of the outer loop is T\ + T2, 
where T\ is the time complexity to reduce the states of the 
machines in At; and T2 is the time complexity to find the ma- 
chines among 5 that increment d mm . First, let us consider T\ . 
Note that, initially At, i.e, All , contains only the RCP with 
O(N) states and for any iteration of the state reduction loop, 
each of the machines in At; has O(N) states. Given a machine 
M with O(N) states, the reduceState algorithm generates ma- 
chines with fewer states than M. For each pair of states in 
M, the time complexity to generate the largest closed par- 
tition that contains these states in a combined block is just 
0(N\£\). Since there are 0(N 2 ) pairs of states in M, the time 
complexity of the reduceState algorithm is G(A^ 3 |2'|). Hence, 
Ti = 0(\Mi\N 3 \Z\). 

Now, we consider 7Y Since, there are 0(N 2 ) pairs of 
states in each machine in At;, the reduceState algorithm re- 
turns 0(N 2 ) machines. So, \S\ = 0(N 2 \Mj\). Since there are 
0(N 2 ) nodes in the fault graph of G(P U T), given any ma- 
chine in S, the time complexity to check if it increments d mm 
is 0(N 2 ). Hence, T 2 = 0(\S\N 2 ) = 0(iV 4 |At;l). So, the time 
complexity of each iteration of the state reduction loop is 

r 1 + r 2 = o(|At;|Ar 3 |r| + Ar 4 |At;|). 

Since the reduceState algorithm generates 0(N 2 ) machines 
per machine in At,, |Al,+i| = A^ 2 |Al,j. In the first iteration At 
just contains the RCP and |Ati| = 1. Hence, the time com- 
plexity of the state reduction loop is, <9((Af 3 |2] +N 4 )(l +N 2 + 
N 4 ...+ N 2(AS - l) )) = 0((N 3 |21 + iV 4 )(^p) (the series is a 
geometric progression). This reduces to 0(N As+1 \S\ + N &s+2 ). 
Also, At contains 0(N 2&S ) machines at the end of the state 
reduction loop. 

Event Reduction Loop: The time complexity analysis for 
the event reduction loop is similar, except for the fact that the 
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reduceEvent algorithm iterates through \E\ events of the each 
machine in M and returns <9(|Z]) machines per machine in 
AL Also, while the state reduction loop starts with just one 
machine in At, the event reduction loop starts with 0(N 2As ) 
machines in AL Hence, the time complexity of each iteration 
of the event reduction loop is 0((N\X\ 2 + N 2 \X\)(N 2AS )(l + 
\Z\ + \Z\ 2 ...+ \zr- 1 )) = 0((N\£\ 2 + AT^IXA^X^jgr)) = 
0(N AS+l \I\ Ae+l + N AS+2 m Ae ). 

Minimality Loop: In the minimality loop, we use the re- 
duceState algorithm, but only select one machine per itera- 
tion. Also, in each iteration of the minimality loop, the num- 
ber of states in M is at least one less than than the number 
of states in M for the previous iteration. Hence, the minimal- 
ity loop executes O(N) iterations with total time complexity, 
0((N 3 m + N 4 )(N)) = 0{N 4 \E\ + N 5 ). 

Since there are / iterations of the outer loop, the time 
complexity of the genFusion algorithm is, 

0(fN AS+l m+fN AS+2 + 

fN AS+l \£\ Ae+l + fN AS+2 \Z\ Ae + fN 4 \£\ + fN 5 ) 
This reduces to, 

0(fN AS+l \I\ Ae+1 + fN AS+2 \I\ Ae + fN 4 \I\ + fN 5 ) 

Observation 1 For parameters As — and Ae — 0, the gen- 
Fusion algorithm generates a minimal (f, f)-fusion off with 
time complexity 0(fN 4 \E\ + fN 5 ), i.e., the time complexity is 
polynomial in the number of states of the RCP. 

If there are n primaries each with 0{s) states, then N is 
0(s n ). Hence, the time complexity of the genFusion algo- 
rithm reduces to 0{s n \L\f). Even though the time complexity 
of generating the fusions is exponential in n, note that the fu- 
sions have to be generated only once. Further, in Appendix 
iBl we present an incremental approach for the generation 
of fusions that improves the time complexity by a factor of 
0(p") for constant values of p, where p is the average state 
reduction achieved by fusion, i.e., (Number of states in the 
RCP/ Average number of states in each fusion machine). 

5 Detection and Correction of Faults 

In this section, we provide algorithms to detect Byzantine 
faults with time complexity 0(nf), on average, and correct 
crash/Byzantine faults with time complexity 0(npf), with 
high probability, where n is the number of primaries, / is the 
number of crash faults and p is the average state reduction 
achieved by fusion. Throughout this section, we refer to Fig. 
12 with primaries, <P = {A, B, C] and backups T = [F u F 2 ], 
that can correct two crash faults. The execution state of the 
primaries is represented collectively as a «-tuple (referred to 
as the primary tuple) while the state of each backup/fusion 
is represented as the set of primary tuples it corresponds to 



(referred to as the tuple-set). In Fig. [2] if A, B, C and F\ are 
in their initial states, then the primary tuple is a°b°c Q and the 
state of Fx is f° = \a Q b c ,a 1 b c 1 ,a 1 b 1 c ,a°b 1 c 1 } (which 
corresponds to \r°, r 2 , r 4 , r 5 }). 

5.1 Detection of Byzantine Faults 

Given the primary tuple and the tuple-sets corresponding to 
the fusion states, the detectByz algorithm in Fig. [5] detects 
up to / Byzantine faults (liars). Assuming that the tuple-set 
of each fusion state is stored in a permanent hash table at 
the recovery agent, the detectByz algorithm simply checks 
if the primary tuple r is present in each backup tuple-set b. 
In Fig. |2 if the states of machines A, B, C, Fi and Fj are 
a , b x , c°, fl and fj respectively, then the algorithm flags 
a Byzantine fault, since a l b l c° is not present in either fl = 
{a b 1 c ,a l b l c l ,a b°c l ,a 1 b () c } or f\ = {a°b l c°,a l b°c 1 }. 

To show that r is not present in at least one of the backup 
tuple-sets in B when there are liars, we make two observa- 
tions. First, we are only concerned about machines that lie 
within their state set. For example, in Fig. |2 suppose the true 
state of Fi is To lie, if F 2 says it state is any number apart 
from f\ , f 2 and /| , then that can be detected easily. 

Second, like the fusion states, each primary state can be 
expressed as a tuple-set that contains the RCP states it be- 
longs to. Immaterial of whether r is correct or incorrect (with 
liars), it will be present in all the truthful primary states. For 
example, in Fig. [2 if the correct primary tuple is a°b°c° then 
a = {a°b°c , a°b 1 c°,a°b 1 c 1 , a °b°c 1 } contains a°b°c°. If B 
lies, then the primary tuple will be a°b l c°, which is incorrect. 
Clearly, a° contains this incorrect primary tuple as well. 

Theorem 7 Given a set ofn machines V and an { f, f)-fusion 
T corresponding to it, the detectByz algorithm detects up to 
f Byzantine faults among them. 

Proof Let r be the correct primary tuple. Each primary tuple 
is present in exactly one fusion state (the fusion states parti- 
tion the RCP states), i.e, the correct fusion state. Hence, the 
incorrect fusion states (liars) will not contain r and the fault 
will be detected. If r is incorrect (with liars), then for the fault 
to go undetected, r must be present in all the fusion states. 

If r c is the correct primary tuple, then the truthful fusion 
states have to contain r c as well, which implies that they con- 
tain \r, r c ) in the same tuple-set. As observed above, the truth- 
ful primaries will also contain [r, r°} in the same tuple-set. So 
the execution state of all the truthful machines contain {r, r c ] 
in the same tuple-set. Hence less than or equal to / machines, 
i.e, the liars, can contain r and r € in distinct tuple-sets. This 
contradicts the fact that f is a (/, /)-fusion with greater than 
/ machines separating each pair of RCP states. 

We consider the space complexity for maintaining the 
hash tables at the recovery agent. Note that, the space com- 
plexity to maintain a hash table is simply the number of points 
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detectByz 


correctByz 


Input: set of / fusion states B, primary tuple r; 


Input: set of / fusion states B, primary tuple r; 


Output: true if there is a Byzantine fault and false if not; 


Output: corrected primary «-tuple; 


for (b e B) 


D <— {} //list of tuple-sets 


if -^(hashjableib) ■ contains(r)) 


//find tuples in b within Hamming distance L//2J of r 


return true; 


for (b e B) 


return false; 


S <— lshJables(b) ■ search(r, L//2J); 




D-add{S); 


correctCrash 


G <— Set of tuples that appear in D; 


Input: set of available fusion states B, primary tuple r, 


V <— Vote array of size \G\; 


faults among primaries t; 


for (g e G) 


Output: corrected primary n-tuple; 


// get votes from fusions 


£><—{) //list of tuple-sets 


V[g] <— Number of times g appears in D; 


//find tuples in b within Hamming distance t of r 


II get votes from primaries 


for {b 6 6) 


for (i = 1 to n) 


S <— Ishjables(b) ■ search(r,t); 


mm e s) 


D-add(S); 


V[g] + +; 


return Intersection of sets in D; 


return Tuple g such that V[g] >n + L//2J; 



Fig. 5 Detection and correction of faults. 



in the hash table multiplied by the size of each point. In our 
solution we hash the tuples belonging to the fusion states. In 
each fusion machine, there are N such tuples, since the fusion 
states partition the states of the RCP. Each tuple contains n 
primary states each of size log s, where s is the maximum 
number of states in any primary. For example, a°b l c° in 
contains three primary states (« = 3) and since there are two 
states in A (s = 2) we need just one bit to represent it. Since 
there are / fusion machines, we hash a total of Nf points, 
each of size 0(n log s). Hence, the space complexity at the 
recovery agent is 0(Nfn log s). 

Since each fusion state is maintained as a hash table, it 
will take 0{n) time (on average) to check if a primary tu- 
ple with n primary states is present in the fusion state. Since 
there are / fusion states, the time complexity for the detect- 
Byz algorithm is 0(nf) on average. Even for replication, the 
recovery agent needs to compare the state of n primaries with 
the state of each of its / copies, with time complexity O(nf). 
In terms of message complexity, in fusion, we need to ac- 
quire the state of n + f machines to detect the faults, while 
for replication, we need to acquire the state of 2nf machines. 



5.2 Correction of Faults 

Given a primary tuple r and the tuple-set of a fusion state, 
say b, consider the problem of finding the tuples in b that 
are within Hamming distance / of r. This is the key con- 
cept that we use for the correction of faults, as explained 
in sections 15.2.11 and 15.2.21 In Fig. |2l the tuples in /f = 
[a b°c ,a l b°c l , a l b 1 c°, a a b l c 1 } that are within Hamming dis- 
tance one of a primary tuple a°b°c are a°b Q c°, a l b°c l and 



a a b l c l . An efficient solution to finding the points among a 
large set within a certain Hamming distance of a query point 
is locality sensitive hashing (LSH) lfTlfT2ll . Based on this, we 
first select L hash functions \g\ . . .gi\ and for each gj we 
associate an ordered set (increasing order) of k numbers Cj 
picked uniformly at random from {0 . . . «}. The hash function 
gj takes as input an n-tuple, selects the coordinates from them 
as specified by the numbers in Cj and returns the concate- 
nated bit representation of these coordinates. At the recovery 
agent, for each fusion state we maintain L hash tables, with 
the functions selected above, and hash each tuple in the fu- 
sion state. In Fig.|6](/), g\ and g2 are associated with the sets 
C\ = {0, 1} and Cn = {0,2} respectively. Hence, the tuple 
a l b°c l of jf, is hashed into the 2 nd bucket of g l and the 3 rd 
bucket of g2- 



Given a primary tuple r and a fusion state b, to find the 
tuples among b that are within a Hamming distance / of r, 
we obtain the points found in the buckets gj{r) for j — 1 . . . L 
maintained for b and return those that are within distance of 
/ from r. In Fig. 0(0, let r = a°b 1 c°, f = 2,andb = jf. The 
primary tuple r hashes into the I s ' bucket of g\ and the th 
bucket of g2 which contains the points a°b l c l and a°b°c re- 
spectively. Since both of them are withing Hamming distance 
two of r, both the points are returned. If we set L = log 1 _ ? i 6, 
where y = 1 - fin, such that (1 - y k ) L < 6, then any f- 
neighbor of a point q is returned with probability at least 1-6 
fl TCZll . In the following sections, we present algorithms for 
the correction of crash and Byzantine faults based on these 
LSH functions. 



Fault Tolerance in Distributed Systems using Fused State Machines 



13 



gi (Coordinates and 1) 92 (Coordinates and 2) 


<?i (Coordinates and 1) 92 


(Coordinates and 2) 




3 


-(oW) 


3 


— (aW) 




3 


— (a l b l c l ) 




3 


>— (aW) 


coordinates and 1 


2 


-(aW) 


T 


— (aW) 




2 






2 




are 01 


1 


-(a'Vc 1 ) 


T 


— (a'Vc 1 ) 




T 






1 









-(aW) 





— (aW) 







— {a a b"c») 







— (aW) 


(i) Fusion State /j> = {a"b"c'\ aV 


JiW,n'W) 


(ii) Fusion State /2 1 





Fig. 6 LSH example for fusion states in Fig.[2]with k = 2, L = 2. 



5.2.7 Crash Correction 

Given the primary tuple (with possible gaps due to faults) and 
the tuple-sets of the available fusion states, the correctCrash 
algorithm in Fig. [5] corrects up to / crash faults. The algo- 
rithm finds the set of tuple-sets S in each fusion state b, where 
each tuple belonging to S is within a Hamming distance t of 
the primary tuple r. Here, t is the number of faults among 
the primaries. To do this efficiently, we use the LSH tables 
of each fusion state. The set S returned for each fusion state 
is stored in a list D. If the intersection of the sets in D is 
singleton, then we return that as the correct primary tuple. 
If the intersection is empty, we need to exhaustively search 
each fusion state for points within distance t of r (LSH has 
not returned all of them), but this happens with a very low 
probability lfT1[T2l. 

In Fig. 12 assume crash faults in B and C. Given the states 
of A, F\ and F2 as a°, f® and /,° respectively, the tuples 
within Hamming distance two of r — a .{empty}. [empty] 
among states jf = la°b°c Q ,a 1 b Q c 1 ,a 1 b 1 c Q ,a°b 1 c 1 } and /° = 
{a°b°c Q , a l b l c l } are {a Q b°c°, a°b 1 c 1 } and {a°b°c Q } respectively. 
The algorithm returns their intersection, a°b°c° as the cor- 
rected primary tuple. In the following theorem, we prove that 
the correctCrash algorithm returns a unique primary tuple. 

Theorem 8 Given a set ofn machines V and an ( f, f)-fusion 
f corresponding to it, the correctCrash algorithm corrects 
up to f crash faults among them. 

Proof Since there are t gaps due to t faults in the primary tu- 
ple r, the tuples among the backup tuple-sets within a Ham- 
ming distance t of r, are the tuples that contain r (definition 
of Hamming distance). Let us assume that the intersection 
of the tuple-sets among the fusion states containing r is not 
singleton. Hence all the available fusion states have at least 
two RCP states, {r\ r J }, that contain r. Similar to the proof in 
theorem |7] since both r' and r> contain r, these states will be 
present in the same tuple-sets of all the available primaries as 
well. Hence less than or equal to / machines, i.e, the failed 
machines, can contain r' and r' in distinct tuple-sets. This 
contradicts the fact that f is an (f, /)-fusion with greater 
than / machines separating each pair of RCP states. 

The space complexity analysis is similar to that for Byzan- 
tine detection since we maintain hash tables for each fusion 
state and hash all the tuples belonging to them. Assuming L 



is a constant, the space complexity of storage at the recovery 
agent is 0(Nfn log s). 

Let p be the average state reduction achieved by our fusion- 
based technique. Each fusion machine partitions the states of 
the RCP and the average size of each fusion machine is N/p. 
Hence, the number of tuples (or points) in each fusion state 
is p. This implies that there can be 0(p) tuples in each fusion 
state that are within distance / of r. So, the cost of hashing 
r and retrieving 0(p) n-dimensional points from 0(f) fusion 
states in B is O(npf) w.h.p (assuming k, L for the LSH tables 
are constants). So, the cost of generating D is 0(npf) w.h.p. 
Also, the number of tuple sets in D is 0(pf). 

In order to find the intersection of the tuple-sets in D in 
linear time, we can hash the elements of the smallest tuple-set 
and check if the elements of the other tuple-sets are part of 
this set. The time complexity to find the intersection among 
the 0(pf) points in D, each of size n is simply 0(npf). Hence, 
the overall time complexity of the correctCrash algorithm is 
0(npf) w.h.p. Crash correction in replication involves copy- 
ing the state of the copies of the / failed primaries which 
has time complexity 9(f). In terms of message complexity, 
in fusion, we need to acquire the state of all n machines that 
remain after / faults. In replication we just need to acquire 
the copies of the / failed primaries. 

5.2.2 Byzantine Correction 

Given the primary tuple and the tuple-sets of the fusion states, 
the correctByz algorithm in Fig.[5]corrects up to [f /2J Byzan- 
tine faults. The algorithm finds the set of tuples among the 
tuple-sets of each fusion state that are within Hamming dis- 
tance L//2J of the primary tuple r using the LSH tables and 
stores them in list D. It then constructs a vote vector V for 
each unique tuple in this list. The votes for each tuple g e V 
is the number of times it appears in D plus the number of 
primary states of r that appear in g. The tuple with greater 
than or equal to n + L//2J votes is the correct primary tu- 
ple. When there is no such tuple, we need to exhaustively 
search each fusion state for points within distance L//2J of 
r (LSH has not returned all of them). In Fig. [2] let the states 
of machines A, B, C F\ and F2 are a°, b 1 , c°, f® and fz re- 
spectively, with one liar among them (L//2J = 1). The tuples 
within Hamming distance one of r = a°b l c° among jf = 
{a°b c ,a l b Q c l ,a l b 1 c ,a°b 1 c 1 }andf° = [a b c Q ,a l b l c 1 }aie 
{a°b c ,a l b 1 c ,a°b 1 c 1 } and {a°b°c } respectively. Here, tu- 
ple a b e wins a vote each from F\ and F2 since a°b°c° is 
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present in f® and f%. It also wins a vote each from A and C, 
since the current states of A and C, a and c°, are present in 
a°b°c°. The algorithm returns a°b°c Q as the true primary tu- 
ple, since n + L//2J = 3 + 1 =4. We show in the following 
theorem that the true primary tuple will always get sufficient 
votes. 

Theorem 9 Given a set ofn machines V and an ( f, f)-fusion 
T corresponding to it, the correctByz algorithm corrects up 
to L//2J Byzantine faults among them. 

Proof We prove that the true primary tuple, r c will uniquely 
get greater than or equal to (n + L//2J) votes. Since there are 
less than or equal to L//2J liars, r c will be present in the tuple- 
sets of greater than or equal to n + L//2J machines. Hence 
the number of votes to r e , V[r°] is greater than or equal to 
(n + L//2J). An incorrect primary tuple r w can get votes from 
less than or equal to L//2J machines (i.e, the liars) and the 
truthful machines that contain both r c and r w in the same 
tuple-set. Since T is an (/, /)-fusion of f, among all the 
n + f machines, less than n of them contain {r c , r w ] in the 
same tuple-set. Hence, the number of votes to r w , V[r w ] is 
less than (n + L//2J) which is less than V[r c ]. 

The space complexity analysis is similar to crash correc- 
tion. The time complexity to generate D, same as that for 
crash fault correction is 0(npf) w.h.p. If we maintain G as 
a hash table (standard hash functions), to obtain votes from 
the fusions, we just need to iterate through the / sets in D, 
each containing O(p) points of size n each and check for 
their presence in G in constant time. Hence the time com- 
plexity to obtain votes from the backups is 0(npf). Since 
the size of G is 0(pf), the time complexity to obtain votes 
from the primaries is again 0(npf), giving over all time com- 
plexity 0(npf) w.h.p. In the case of replication, we just need 
to obtain the majority across / copies of each primary with 
time complexity 0(nf). The message complexity analysis is 
the same as Byzantine detection, because correction can take 
place only after acquiring the state of all machines and de- 
tecting the fault. 



6 Practical use of Fusion in the MapReduce Framework 

To motivate the practical use of fusion, we discuss its poten- 
tial application to the MapReduce framework which is used 
to model large scale distributed computations. Typically, the 
MapReduce framework is built using the master- worker con- 
figuration where the master assigns the map and reduce tasks 
to various workers. While the map tasks perform the actual 
computation on the data files received by it as <key, value> 
pairs, the reducer tasks aggregate the results according to the 
keys and writes it to the output file. 

Note that, in batch processing application for MapRe- 
duce, fault tolerance is based on passive replication. So, a 



task that failed would simply be restarted on another worker 
node. However, our work is targetted towards applications 
such as distributed stream processing, with strict deadlines. 
Here, active replication is often used for fault tolerance ll27l 
|6). Hence, tasks are replicated at the beginning of the com- 
putation, to ensure that despite failures there are sufficient 
workers remaining. 

In this paper, we focus on the distributed grep applica- 
tion based on the MapReduce framework. Given a continu- 
ous stream of data files, the grep application checks if every 
line of the file matches patterns defined by regular expres- 
sions (modeled as DFSMs). Specifically, we assume that the 
expressions are ((0 + 1)(0 + 1))*, ((0 + 2)(0 + 2))* and (00)* 
modeled by A, B, C shown in Fig. [1] We show using a sim- 
ple case study that the current replication based solution re- 
quires 1.8 million map tasks while our solution that combines 
fusion with replication requires only 1.4 million map tasks. 
This results in considerable savings in space and other com- 
putational resources. 



Map Tasks 





(i) Replication: 18 Map Tasks (it) Hybrid: 14 Map Tasks 

Fig. 7 Replication vs. Fusion for grep using the MapReduce frame- 
work. 



6. 1 Existing Replication-based Solution 

We first outline a simplified version of a pure replication 
based solution to correct two crash faults in Fig. 0(z). Given 
an input file stream, the master splits the file into smaller 
partitions (or streams) and breaks these partitions into <file 
name, file content> tuples. For each partition, we maintain 
three primary map tasks m^, mg and mc that output the lines 
that match the regular expressions modeled by A, B and C 
respectively. To correct two crash faults, we maintain two ad- 
ditional copies of each primary map task for every partition. 
The master sends tuples belonging to each partition to the pri- 
maries and the copies. The reduce phase just collects all lines 
from these map task and passes them to the user. Note that, 
the reducer receives inputs from the primaries and its copies 
and simply discards duplicate inputs. Hence, the copies help 
in both fault tolerance and load-balancing. 
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When map tasks fail, the state of the failed tasks can be 
recovered from one of the remaining copies. From Fig. [71(h), 
it is clear that each file partition requires nine map tasks. In 
such systems, typically, the input files are large enough to 
be partitioned into 200,000 partitions (0. Hence, replication 
requires 1.8 million map tasks. 

6.2 Hybrid Fusion-based Solution 

In this section, we outline an alternate solution based on a 
combination of replication and fusion, as shown in Fig. [7] 
(//). For each partition, we maintain just one additional copy 
of each primary and also maintain one fused map task, de- 
noted nif for the entire set of primaries. The fused map task 
searches for the regular expression (11)* modeled by F\ in 
Fig-IU Clearly, this solution can correct two crash faults among 
the primary map tasks, identical to the replication-based so- 
lution. The reducer operation remains identical. The output 
of the fused map task is relevant only for fault tolerance and 
hence it does not send its output to the reducer. Note that 
since there is only one additional copy of each primary, we 
compromise on the load balancing as compared to pure repli- 
cation. However, we require only seven map tasks as com- 
pared to the nine map tasks required by pure replication. 

When only one fault occurs among the map tasks, the 
state of the failed map task can be recovered from the remain- 
ing copy with very little overhead. Similarly, if two faults oc- 
cur across the primary map tasks, i.e., and nig fail, then 
their state can be recovered from the remaining copies. Only 
in the relatively rare event that two faults occur among the 
copies of the same primary, that the fused map task has to be 
used for recovery. For example, if both copies of fail, then 
nif needs to acquire the state of m# and mc (any of the copies) 
and perform the algorithm for crash correction in 15.2. 1 I to re- 
cover the state of ma- Considering 200,000 partitions, the hy- 
brid approach needs only 1 .4 million map tasks which is 22% 
lesser map tasks than replication, even for this simple exam- 
ple. Note that as n increases, the savings in the number of 
map tasks increases even further. This results in considerable 
savings in terms of (/) the state space required by these map 
tasks (ii) resources such as the power consumed by them. 

7 Experimental Evaluation 

In this section, we evaluate fusion using the MCNC'91 bench- 
marks [[301 for DFSMs, widely used for research in the fields 
of logic synthesis and finite state machine synthesis H21II31I . 
In Table [3] we specify the number of states and number of 
events/inputs for the benchmark machines presented in our 
results. We implemented an incremental version of the gen- 
Fusion algorithm (Appendix iBb in Java 1.6 and compared 
the performance of fusion with replication for 100 different 



Table 3 MCNC 91 Benchmark Machines 



Machines 


States 


Events 


dkl5 


4 


8 


bbara 


10 


16 


mc 


4 


8 


lion 


4 


4 


bbtas 


6 


4 


tav 


4 


16 


modulo 12 


12 


2 


beecount 


7 


8 


shiftreg 


8 


2 



combinations of the benchmark machines, with n — 3, / = 2, 
Ae = 3 and present some of the results in Table|4] The imple- 
mentation with detailed results are available in [|3) . 

Let the primaries be denoted P\, P2 and P3 and the fused- 
backups F\ and Fi. Column 1 of Table|4]specifies the names 
of three primary DFSMs. Column 2 specifies the backup space 
required for replication (I~I/=i 3 1-P»r0 , column 3 specifies the 
backup space for fusion dlSi l^/l) an d column 4 specifies 
the percentage state space savings ((column 2-column 3)* 
100/column 2). Column 5 specifies the total number of pri- 
mary events, column 6 specifies the average number of events 
across F\ and F2 and the last column specifies the percentage 
reduction in events ((column 5-column 6)* 100/column 5). 

For example, consider the first row of Table 2] The pri- 
mary machines are the ones named dkl5, bbara and mc. Since 
the machines have 4, 10 and 4 states respectively (Table [3}, 
the replication state space for / = 2, is the state space for 
two additional copies of each of these machines, which is 
(4 * 10 * 4) 2 = 25600. The two fusion machines generated for 
this set of primary machines each had 140 states and hence, 
the total state space for fusion as a solution is 19600. For the 
benchmark machines, the events are binary inputs. For ex- 
ample, as seen in Table[3] dkl5 contains eight events. Hence, 
the event set of dkl5 = {0, 1 , . . . , 7}. The event sets of the pri- 
maries is the union of the event set of each primary. So, for 
the first row of Table|4] the primary event set is {0, 1, . . . 15}. 
In this example, both fusion machines had 10 events and 
hence, the average number of fusion events is 10. 

The average state space savings in fusion (over replica- 
tion) is 38% (with range 0-99%) over the 100 combination 
of benchmark machines, while the average event-reduction 
is 4% (with range 0-45%). We also present results in ||3] that 
show that the average savings in time by the incremental ap- 
proach for generating the fusions (over the non-incremental 
approach) is 8%. Hence, fusion achieves significant savings 
in space for standard benchmarks, while the event-reduction 
indicates that for many cases, the backups will not contain a 
large number of events. 
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Table 4 Evaluation of Fusion on the MCNC'91 Benchmarks 



Machines 


Replication 


Fusion State 


% Savings 


Primary 


Fusion 


% Reduction 




State Space 


Space 


State Space 


Events 


Events 


Events 


dkl5, bbara, mc 


25600 


19600 


23.44 


16 


10 


37.5 


lion, bbtas, mc 


9216 


8464 


8.16 


8 


7 


12.5 


lion, tav, modulo 12 


36864 


9216 


75 


16 


16 





lion, bbara, mc 


25600 


25600 





16 


9 


43.75 


tav, beecount, lion 


12544 


10816 


13.78 


16 


16 





mc, bbtas, shiftreg 


36864 


26896 


27.04 


8 


7 


12.5 


tav, bbara, mc 


25600 


25600 





16 


16 





dkl5, modulol2, mc 


36864 


28224 


23.44 


8 


8 





modulo 12, lion, mc 


36864 


36864 





8 


7 


12.5 



8 Discussion: Backups Outside the Closed Partition Set 

So far in this paper, we have only considered machines that 
belong to the closed partition set. In other words, given a 
set of primaries P, our search for backup machines was re- 
stricted to those that are less than the RCP of P, denoted by 
R. However, it is possible that efficient backup machines ex- 
ist outside the lattice, i.e., among machines that are not less 
than or equal to R. In this section, we present a technique to 
detect if a machine outside the closed partition set of R can 
correct faults among the primaries. Given a set of machines 
in T each less than or equal to R, we can determine if P U T 
can correct faults based on the d mm ofPUf (section l3~3l >. To 
find d m i„, we first determine the mapping between the states 
of R to the states of each of the machines in T . However, 
given a set of machines in Q that are not less than or equal to 
R, how do we generate this mapping? 

To determine the mapping between the states of R to the 
states of the machines in Q, we first generate the RCP of 
{R} U Q, denoted B, which is be greater than all the machines 
in \R\ U Q. Hence, we can determine the mapping between 
the states of B and the states of all the machines in {R} U Q. 
Given this mapping, we can determine the (non-unique) map- 
ping between the states of R and the states of the machines 
in Q. This enables us to determine d mm {R, \R\ U Q). If this 
dmin is greater than /, then Q can correct / crash or L//2J 
Byzantine faults among the machines in P. 

Consider the example shown in Fig. [8] Given the set of 
primaries {A, B, C] shown in Fig. [Tj we want to determine if 
G can correct one crash fault among {A, B, C). Since G is out- 
side the closed partition set of R, we first construct B, which 
is the RCP of G and R. Since B is greater than both R and G, 
we can determine how its states are mapped to the states of R 
and G (similar to Fig. |2}. For example, b° and b s are mapped 
to r° in R, while b° and b 9 are mapped to g° in G. Using 
this information, we can determine the mapping between the 
states of R and G. For example, since b° and b 9 are mapped 
to r° and r 2 respectively, g° = {r°,r 2 }. Extending this idea, 
we get: 




Fig. 8 Machine outside the closed partition set of R in Fig. [2] 



In Fig.[3](/0, the weakest edges of G({A, B, C}) are (r°, r 1 ) 
and (r 2 , r 3 ) (the other weakest edges not shown). Since G sep- 
arates all these edges, it can correct one crash fault among 
the machines in {A, B, C}. However, note that, the machines 
in {A, B, C} cannot correct a fault in G. For example, if G 
crashes and R is in state r°, we cannot determine if G was in 
state g° or g A . This is clearly different from the case of the 
fusion machines presented in this paper, where faults could 
be corrected among both primaries and backups. 

9 Related Work 

Our work in [5] introduces the concept of the fusion of DF- 
SMs, and presents an algorithm to generate a backup to cor- 
rect one crash fault among a given set of machines. This 
paper is based on our work in ER l. The work presented 
in lfTTl l2l [T0l explores fault tolerance in distributed systems 
with programs hosting large data structures. The key idea 
there is to use erasure/error correcting codes []7j to reduce the 
space overhead of replication. Even in this paper, we exploit 
the similarity between fault tolerance in DFSMs and fault 
tolerance in a block of bits using erasure codes in section 
13.31 However, there is one important difference between era- 
sure codes involving bits and the DFSM problem. In erasure 
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codes, the value of the redundant bits depend on the data bits. 
In the case of DFSMs, it is not feasible to transmit the state 
of all the machines after each event transition to calculate the 
state of the backup machines. Further, recovery in such an 
approach is costly due to the cost of decoding. In our solu- 
tion, the backup machines act on the same inputs as the orig- 
inal machines and independently transition to suitable states. 
Extensive work has been done 11611151 on the minimization 
of completely specified DFSMs, but the minimized machines 
are equivalent to the original machines. In our approach, we 
reduce the RCP to generate efficient backup machines that 
are lesser than the RCP. Finally, since we assume a trusted 
recovery agent, the work on consensus in the presence of 
Byzantine faults 11811231 . does not apply to our paper. 

10 Conclusion 

We present a fusion-based solution to correct / crash or L//2J 
Byzantine faults among n DFSMs using just / backups as 
compared to the traditional approach of replication that re- 
quires nf backups. In tabled we summarize our results and 
compare the various parameters for replication and fusion. In 
this paper, we present a framework to understand fault tol- 
erance in machines and provide an algorithm that generates 
backups that are optimized for states as well as events. Fur- 
ther, we present algorithms for detection and the correction 
of faults with minimal overhead over replication. 

Our evaluation of fusion over standard benchmarks shows 
that efficient backups exist for many examples. To illustrate 
the practical use of fusion, we describe a fusion-based design 
of a distributed application in the MapReduce framework. 
While the current replication-based solution may require 1 .8 
million map tasks, a fusion-based solution requires just 1.4 
million map tasks with minimal overhead in terms of time 
as compared to replication. This can result in considerable 
savings in space and other computational resources such as 
power. 

In the future, we wish to implement the design presented 
in section [6] using the Hadoop framework ||29l and compare 
the end-to-end performance of replication and our fusion- 
based solution. In particular we wish to focus on the space 
incurred by both solutions, the time and computation power 
taken for a set of tasks to complete with and without faults. 
Further, we wish to explore the existence of efficient back- 
ups if we allow information exchange among the primaries. 
Finally, we wish to design efficient algorithms to generate 
backups both inside and outside the closed partition set of 
the RCP. 

References 

1. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algo- 
rithms for approximate nearest neighbor in high dimensions . Com- 



mun. ACM, 51(1):1 17-122, 2008. 

2. Bharath Balasubramanian and Vijay K. Garg. Fused data structures 
for handling multiple faults in distributed systems. In Proceedings 
of the 2011 31st International Conference on Distributed Comput- 
ing Systems, ICDCS '11, pages 677-688, Washington, DC, USA, 
201 1. IEEE Computer Society. 

3. Bharath Balasubramanian and Vijay K. Garg. Fused fsm design 
tool (implemented in java 1.6). In Parallel and Distributed Systems 
Laboratory, http://maple.ece.utexas.edu, 2011. 

4. Bharath Balasubramanian and Vijay K. Garg. Fused state ma- 
chines for fault tolerance in distributed systems. In Principles 
of Distributed Systems - 15th International Conference, OPODIS 
2011, Toulouse, France, December 13-16, 2011. Proceedings, vol- 
ume 7109 of Lecture Notes in Computer Science, pages 266-282. 
Springer, 2011. 

5. Bharath Balasubramanian, Vinit Ogale, and Vijay K. Garg. Fault 
tolerance in finite state machines using fusion. In Proceedings of In- 
ternational Conference on Distributed Computing and Networking 
(ICDCN) 2008, Kolkata, volume 4904 of Lecture Notes in Com- 
puter Science, pages 124-134. Springer, 2008. 

6. Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and 
Mike Stonebraker. Fault-Tolerance in the Borealis Distributed 
Stream Processing System. In ACM SIGMOD Conf, Baltimore, 
MD, June 2005. 

7. E. R. Berlekamp. Algebraic Coding Theory. McGraw-Hill, New 
York, 1968. 

8. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data 
processing on large clusters. Commun. ACM, 51:107-113, January 
2008. 

9. Xavier Defago, Andre Schiper, and Peter Urban. Total order broad- 
cast and multicast algorithms: Taxonomy and survey. ACM Corn- 
put. Sun:, 36(4):372-421, December 2004. 

10. Vijay K. Garg. Implementing fault-tolerant services using state ma- 
chines: beyond replication. In Proceedings of the 24th international 
conference on Distributed computing, DISC 10, pages 450^-64, 
Berlin, Heidelberg, 2010. Springer- Verlag. 

11. Vijay K. Garg and Vinit Ogale. Fusible data structures for fault 
tolerance. In ICDCS 2007: Proceedings of the 27th International 
Conference on Distributed Computing Systems, June 2007. 

12. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity 
search in high dimensions via hashing. In VLDB '99: Proceedings 
of the 25th International Conference on Very Large Data Bases, 
pages 518-529, San Francisco, CA, USA, 1999. Morgan Kaufmann 
Publishers Inc. 

13. Richard Hamming. Error-detecting and error-correcting codes. 
In Bell System Technical Journal, volume 29(2), pages 147-160, 
1950. 

14. J. Hartmanis and R. E. Stearns. Algebraic structure theory of se- 
quential machines (Prentice-Hall international series in applied 
mathematics). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 
1966. 

15. John E. Hopcroft. An n log n algorithm for minimizing states in a 
finite automaton. Technical report, Stanford, CA, USA, 1971. 

16. David A. Huffman. The synthesis of sequential switching circuits. 
Technical report, Massachusetts, USA, 1954. 

17. Leslie Lamport. The implementation of reliable distributed multi- 
process systems. Computer networks, 2:95-114, 1978. 

18. Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzan- 
tine generals problem. ACM Transactions on Programming Lan- 
guages and Systems, 4:382-401, 1982. 

19. David Lee and Mihalis Yannakakis. Closed partition lattice and ma- 
chine decomposition. IEEE Trans. Comput., 5 1(2): 2 16-228, 2002. 

20. P. M. Melliar-Smith, L. E. Moser, and V. Agrawala. Broadcast pro- 
tocols for distributed systems. IEEE Trans. Parallel Distrib. Syst., 
1(1): 17-25, January 1990. 

21. Alan Mishchenko, Satrajit Chatterjee, and Robert Brayton. Dag- 
aware aig rewriting: A fresh look at combinational logic synthesis. 



18 



Bharath Balasubramanian, Vijay K. Garg 



In In DAC 06: Proceedings of the 43rd annual conference on De- 
sign automation, pages 532-536. ACM Press, 2006. 

22. Vinit Ogale, Bharath Balasubramanian, and Vijay K. Garg. A 
fusion-based approach for tolerating faults in finite state machines. 
In Proceedings of the 2009 IEEE International Symposium on Par- 
allel & Distributed Processing, IPDPS '09, pages 1-11, Washing- 
ton, DC, USA, 2009. IEEE Computer Society. 

23. M. Pease and L. Lamport. Reaching agreement in the presence of 
faults. Journal of the ACM, 27:228-234, 1980. 

24. Wesley W. Peterson and E. J. Weldon. Error-Correcting Codes - 
Revised, 2nd Edition. The MIT Press, 2 edition, March 1972. 

25. Fred B. Schneider. Byzantine generals in action: implementing fail- 
stop processors. ACM Trans. Comput. Syst., 2(2):145-154, 1984. 

26. Fred B. Schneider. Implementing fault-tolerant services using the 
state machine approach: A tutorial. ACM Computing Surveys, 
22(4):299-319, 1990. 

27. Mehul A. Shah, Joseph M. Hellerstein, and Eric Brewer. Highly 
available, fault-tolerant, parallel dataflows. In Proceedings of the 
2004 ACM SIGMOD International Conference on Management of 
Data, SIGMOD '04, pages 827-838, New York, NY, USA, 2004. 
ACM. 

28. Fathi Tenzakhti, Khaled Day, and M. Ould-Khaoua. Replication 
algorithms for the world-wide web. J. Syst. Archit., 50(10):591- 
605, 2004. 

29. Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 
1st edition, 2009. 

30. Saeyang Yang. Logic synthesis and optimization benchmarks user 
guide version 3.0, 1991. 

3 1 . Hiroshi Youra, Tomoo Inoue, Toshimitsu Masuzawa, and Hideo Fu- 
jiwara. On the synthesis of synchronizable finite state machines 
with partial scan. Systems and Computers in Japan, 29(l):53-62, 
1998. 



Fault Tolerance in Distributed Systems using Fused State Machines 



19 



A Event-Based Decomposition of Machines 




Q (Parity of Os) 



Fig. 11 Event-based decomposition of a machine. 



In this section, we ask a question that is fundamental to the under- 
standing of DFSMs, independent of fault-tolerance: Given a machine 
M, can it be replaced by two or more machines executing in parallel, 
each containing fewer events than M? In other words, given the state 
of these fewer-event machines, can we uniquely determine the state of 
M? In Fig. [TT] the 2-event machine M (it contains events and 1 in 
its event set), checks for the parity of Os and Is. M can be replaced by 
two 1-event machines P and Q, that check for the parity of just Is or Os 
respectively. Given the state of P and Q, we can determine the state of 
M. In this section, we explore the problem of replacing a given machine 
M with two or more machines, each containing fewer events than M. 
We present an algorithm to generate such event-reduced machines with 
time complexity polynomial in the size of M. This is important for ap- 
plications with limits on the number of events each individual process 
running a DFSM can service. We first define the notion of event-based 
decomposition. 

Definition 5 A (k,e)-event decomposition of a machine M (Xm, cim, 
E M , m°) is a set of k machines £, each less than M, such that d mj „(M, £) > 
and VP(X P , a P ,L P , p°) e £, \I P \ < \Z M \ - e. 

As d mi „(M, £) > 0, given the state of the machines in £, the state of 
M can be determined. So, the machines in £, each containing at most 
\I M \ - e events, can effectively replace M. In Fig.[T2] we present the 
eventDecompose algorithm that takes as input, machine M, parameter 
e, and returns a (£,e)-event decomposition of M (if it exists) for some 
* < \X M \ 2 - 

In each iteration, Loop 1 generates machines that contain at least 
one event less than the machines of the previous iteration. So, starting 
with M in the first iteration, at the end of e iterations, M contains the set 
of largest machines less than M, each containing at most \Zm I _ e events. 

Loop 2, iterates through each machine P generated in the previ- 
ous iteration, and uses the reduceEvent algorithm (same as the algo- 
rithm presented in Fig. |4j to generate the set of largest machines less 
than P containing at least one event less than Z P . To generate a ma- 
chine less than P, that does not contain an event cr in its event set, 
the reduceEvent algorithm combines the states such that they loop onto 
themselves on cr. The algorithm then constructs the largest machine that 
contains these states in the combined form. This machine, in effect, ig- 
nores cr. This procedure is repeated for all events in S P and the largest 
incomparable machines among them are returned. Loop 3 constructs 
an event-decomposition £ of M, by iteratively adding at least one ma- 
chine from M to separate each pair of states in M, thereby ensuring that 
d m m{£>) > 0. Since each machine added to £ can separate more than one 
pair of states, an efficient way to implement Loop 3 is to check for the 
pairs that still need to be separated in each iteration and add machines 
till no pair remains. 



Let the 4-event machine M shown in Fig. f^] be the input to the 
eventDecompose algorithm with e = 1. In the first and only iteration 
of Loop 1, P = M and the reduceEvent algorithm generates the set 
of largest 3-event machines less than M, by successively eliminating 
each event. To eliminate event 0, since m° transitions to m 3 on event 
0, these two states are combined. This is repeated for all states and the 
largest machine containing all the combined states self looping on event 
is Mi. Similarly, the largest machines not acting on events 3,1 and 
2 are Mi, M3 and M ± respectively. The reduceEvent algorithm returns 
Mi and Mi as the only largest incomparable machines in this set. The 
eventDecompose algorithm returns £ = [Mi, M2), since each pair of 
states in M are separated by Mi or M%. Hence, the 4-event M can be 
replaced by the 3-event M, and Mi, i.e., £ = [M I ,M 2 ] is a (2,l)-event 
decomposition of M. 

Theorem 1 Given machine M (Xm, Hm,Im> m °), the eventDecompose 
algorithm generates a (k,e)-event decomposition of M (if it exists) for 
some k < \X M \ 2 . 

Proof The reduceEvent algorithm exhaustively generates the largest in- 
comparable machines that ignore at least one event in Z M . After e such 
reduction in events, Loop 3 selects one machine (if it exists) among M 
to separate each pair of states in X M . This ensures that at the end of 
Loop 3, either d m j„(S) > or the algorithm has returned {) (no (k,e)- 
event decomposition exists). Since there are at most |X M p pairs of states 
in Xm, there are at most \Xm\~ iterations of Loop 3, in which we pick 
one machine per iteration. Hence, k < |X M | 2 . 

The reduceEvent algorithm visits each state of machine M to cre- 
ate blocks of states which loop to the same block on event cr e Xm- 
This has time complexity 0(|X M |) per event. The cost of generating the 
largest closed partition corresponding to this block is 0(|XmIKw|) per 
event. Since we need to do this for all events in Z M , the time complex- 
ity to reduce at least one event is OOXmII-SmI")- In the eventDecompose 
algorithm, the first iteration generates at most \Z M \ machines, the sec- 
ond iteration at most |27m| 2 machines and the e' h iteration will contain 
0(\Z M \ 1 ') machines. The rest of the analysis is similar to the one pre- 
sented in section l4~2l and the time complexity of the reduceEvent algo- 
rithm is o(\x M \\z M r i ). 

To generate the (£,e)-event decomposition from the set of machines 
in Ai, we find a machine in M to separate each pair of states in Xm- 
Since there are 0(\X M \ 2 ) such pairs, the number of iterations of Loop 3 
is 0(\Xm\"). In each iteration of Loop 3, we find a machine among the 
OQZ/nf) machines of AI that separates a pair w,-, m; e X M . To check 
if a machine separates a pair of states just takes 0(\Xm\) time. Hence 
the time complexity of Loop 3 is 0(|X M | 3 |2' M | < '). So, the overall time 
complexity of the eventDecompose algorithm is the sum of the time 
complexities of Loop 1 and 3, which is 0(|X M ||27 M |'' +1 + |X M | 3 |27|''). 



B Incremental Approach to Generate Fusions 

In Fig. 1131 we present an incremental approach to generate the fusions, 
referred to as the incFusion algorithm, in which we may never have 
to reduce the RCP of all the primaries. In each iteration, we generate 
the fusion corresponding to a new primary and the RCP of the (possibly 
small) fusions generated for the set of primaries in the previous iteration. 

In Fig.[l4] rather than generate a fusion by reducing the 8-state RCP 
of [A, B, C), we can reduce the 4-state RCP of [A, B) to generate fusion 
F' and then reduce the 4-state RCP of [C, F' } to generate fusion F. 
In the following paragraph, we present the proof of correctness for the 
incremental approach and show that it has time complexity 0(p") times 
better than that of the genFusion algorithm, where p is the average state 
reduction achieved by fusion. 

Theorem 2 Given a set ofn machines P, the incFusion algorithm gen- 
erates an (f, f)-fusion of P. 
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incFusion 

Input: Primaries <P = {Pi, P 2 , . . . P n ), faults /, 
state-reduction pai'ameter As, event-reduction pai'ameter Ae; 
Output: (/, /)-fusion of P; 

for (i = 2 to n) 

N <- {Pi) U RCP(T); 

T <— genFusion(N, f, As, Ae); 
return T; 



Pi). Let the (f, /)-fusion generated by the genFusion algorithm for 
N = {P\,Pi} be denoted T . For i = 3, let the (f, /)-fusion gener- 
ated for N = {P 3 ,RCP(T')) be denoted T 2 . We show that T 2 is an (/, 
/)-fusion of {P\,P2, P 3 )- Assume / crash faults among {P\P2, P^UT 2 . 
Clearly, less than or equal to / machines in {P3 )WF 2 have crashed. Since 
T is an (/, /)-fusion of \Pt,,RCP(T 1 )}, we can generate the state of all 
the machines in RCPCF 1 ) and the state of the crashed machines among 
(P3} U T 2 . Similarly, less than or equal to / machines have crashed 
among (Pi , Pj). Hence, using the state of the available machines among 
{Pi , Pi ) and the states of all the machines in T we can generate the 
state of the crashed machines among {Pi , P2). 



Fig. 13 Incremental fusion algorithm. 



Proof We prove the theorem using induction on the variable in the al- 
gorithm. For the base case, i.e., i = 2, N = {Pi, P2) (since PCP({Pi() = 



Induction Hypothesis: Assume that the set of machines T', gen- 
erated in iteration /, is an (/, /)-fusion of {Pi . . .P, + i). Let the (/, /)- 
fusion of {P, + 2,RCP(!F')I generated in iteration i + 1 be denoted T' +i . 
To prove: is an (/, /)-fusion of {Pj . . . P,+2l- The proof is similar 
to that for the base case. Using the state of the available machines in 
{Pi+2) U !F' +1 , we can generate the state of all the machines in T' and 



Fault Tolerance in Distributed Systems using Fused State Machines 



21 




i^Q ; M± (self-loops on all events) 



eventDecompose 

Input: Machine M with state set Xm , event set 27m 

and transition function o^; 

Output: (£,e)-event decomposition of M for 

some k < |Xm| 2 ; 

M = |M}; 

for (J = 1 to e) //Loop 1 

for (P e At) //Loop 2 

Q = @ U reduceEvent(P); 
M = §; 

for (m,-, nij e Xm) //Loop 3 

if (3E e M : E separates m,-, my) 

£^6U(£); 
else 

return {}; 
return 6; 

reduceEvent 

Input: Machine P with state set Xp, event set 27/> 
and transition function 

Output: Largest Machines < P with < |2>| - 1 events; 

£ = {}; 

for (cr e 2» 

Set of states, X B = X P ; 

//combine states to self-loop on cr 

for (s e X B ) 

s = iU ctp(s, cr); 

S = SU (Largest machine consistent with X B \\ 
return largest incomparable machines in S; 



Fig. 12 Algorithm for the event-based decomposition of a machine. 




B (Parity of ls,2s) 





C (Parity of Os) 



Fig. 14 Incremental Approach: first generate F' and then F. 



{Pi+2} u T' +x . Subsequently, we can generate the state of the crashed 
machines in \P\ . . . P,+i ). 

From observation [T] the genfusion algorithm has time complexity, 
0(fN 4 \Z\ + fN 5 ) (assuming As = and Ae = for simplicity). Hence, 
if the size of N in the iteration of the incFusion algorithm is denoted 
by Ni, then the time complexity of the incFusion algorithm, Tj„ c is given 
by the expression Z>z£0(fNf\Z\ + fNf). 

Let the number of states in each primary be s. For = 2, the 
primaries are (P|,P2) an d Ni = 0(s 2 ). For i = 3, the primaries are 
{RCP(T [ ), P3}. Note that PCP(T') is also a fusion machine. Since we 
assume an average reduction of p (size of RCP of primaries/average 
size of each fusion), the number of states in RCPCF') is 0(s 2 /p). So , 
N 2 = 0(s 3 /p). Similarly, N 3 = 0(//p 2 ) and N t = 0(i' +1 /p i_1 ). So, 

T mc = 0(|i7|/2-:»/' +4 /p 4 '- 4 +/2tV' +5 /p 5 '- 5 ) 

= 0(\£\fs 4 p 4 r^(s/pf + f s 5 p 5 r=^ s /pf) 
This is the sum of a geometric progression and hence, 

T inc = 0(\Z\fs 4 p 4 (s/p) 4 " +f S 5 p\s/p) 5 ") 



Assuming p and s are constants, T inc = 0(f\Z\s"/p"+fs"/p"). Note that, 
the time complexity of the genFusion algorithm in Fig.^is 0(f\Z\s" + 
fs"). Hence, the incFusion algorithm achieves 0(p") savings in time 
complexity over the genFusion algorithm. 



